ggtree: an r package for visualization and annotation of phylogenetic trees with their covariates and other associated data
Summary
- We present an r package, ggtree, which provides programmable visualization and annotation of phylogenetic trees.
- ggtree can read more tree file formats than other softwares, including newick, nexus, NHX, phylip and jplace formats, and support visualization of phylo, multiphylo, phylo4, phylo4d, obkdata and phyloseq tree objects defined in other r packages. It can also extract the tree/branch/node-specific and other data from the analysis outputs of beast, epa, hyphy, paml, phylodog, pplacer, r8s, raxml and revbayes software, and allows using these data to annotate the tree.
- The package allows colouring and annotation of a tree by numerical/categorical node attributes, manipulating a tree by rotating, collapsing and zooming out clades, highlighting user selected clades or operational taxonomic units and exploration of a large tree by zooming into a selected portion.
- A two-dimensional tree can be drawn by scaling the tree width based on an attribute of the nodes. A tree can be annotated with an associated numerical matrix (as a heat map), multiple sequence alignment, subplots or silhouette images.
- The package ggtree is released under the artistic-2.0 license. The source code and documents are freely available through bioconductor (http://www.bioconductor.org/packages/ggtree).
Introduction
Phylogenetic trees are commonly used to present the evolutionary relationships of species. There are many software packages and Web tools that are designed for displaying phylogenetic trees, such as treeview (Page 1996), figtree (Rambaut 2014), treedyn (Chevenet et al. 2006), itol (Letunic & Bork 2011), evolview (Zhang et al. 2012) and dendroscope (Huson & Scornavacca 2012). Only a subset, such as figtree, treedyn and itol, allows users to annotate trees with colouring branches, highlighted clades and tree features. However, their pre-defined annotating functions are usually limited to some specific evolutionary data and not readily programmable within the same program platform. As phylogenetic trees become more widely used in multidisciplinary studies, there is an increasing need to incorporate various types of their covariates and other associated data from different sources into the trees for visualizations and further analyses. Users then require programmable software to allow high levels of customization and data integration over the trees, in addition to standalone applications that focus on specific analyses and data types.
To fill this gap, we developed ggtree, a package for the r programming language (R Core Team, 2015) released under the bioconductor project (Gentleman et al. 2004). ggtree is built with the merits of ggplot2 (Wickham 2009) that was based on the grammar of graphics (Wilkinson 2005). Unlike most of the other phylogenetic software and r packages that only read tree files in newick and/or nexus formats, ggtree supports more formats including NHX (New Hampshire eXtended format), jplace and phylip. It also allows the evolutionary data to be parsed from the non-standard formatted output data of different software (Table 1) into annotations on a tree. This enables diverse types of annotations to be combined, visualized and further processed on the same tree topology, where new patterns or correlations of evolutionary processes could be more easily identified.
| Programs | Data that can be parsed |
|---|---|
| ape (r package) | Bootstrap values |
| beast | Any information (e.g. substitution rates, node ages, geographic states) stored in the node attributes in the nexus-formatted tree file |
| paml-baseml | Ancestral sequences (from rst file) |
| paml-codeml |
Ancestral sequences (from rst file) dN, dS and ω estimates (from mlc file) |
| hyphy | Ancestral sequences (from the nexus-formatted tree file) |
| phangorn (r package) | Ancestral sequences |
| raxml | Branch support values |
| r8s | Tree with branch in unit of time, rate and absolute substitution |
| pplacer | Taxon placement information from jplace-formatted tree file |
| epa | Taxon placement information from jplace-formatted tree file |
| phylodog | Any information from the NHX-formatted tree file |
| revbayes | Any information from the NHX-formatted tree file |
The r language is increasingly being used in phylogenetics. However, a comprehensive package, designed for viewing and annotating phylogenetic trees, particularly with data integration, is not yet available. Most of the r packages in phylogenetics focus on specific statistical analyses rather than viewing and annotating the trees with more generalized tree-associated data. Some packages, including ape (Paradis, Claude & Strimmer 2004) and phytools (Revell 2012), which are capable of displaying and annotating trees, are developed using the base graphics system of r. outbreaktools (Jombart et al. 2014) and phyloseq (McMurdie & Holmes 2013) extended ggplot2 to draw phylogenetic trees. The ggplot2 system of graphics allows rapid customization and exploration of design solutions. However, these packages were designed for epidemiology and microbiome data, respectively, and did not aim to provide a general solution for tree visualization and annotation (Appendix S1, Supporting Information). The ggtree package inherits versatile properties of ggplot2 and thus allows constructing complex tree views by freely combining multiple layers of annotations from different sources of tree-associated data.
Description
The ggtree package
The ggtree package is designed for annotating phylogenetic trees with their associated data of different types and from various sources. These data could come from users or analysis programs and might include evolutionary rates, ancestral sequences, etc., that are associated with the taxa from real samples, or with the internal nodes representing hypothetic ancestor strain/species, or with the tree branches indicating evolutionary time courses. For instance, the data could be the geographic positions of the sampled avian influenza viruses (informed by the survey locations) and the ancestral nodes (by phylogeographic inference) in the viral gene tree (Lam et al. 2012).
ggtree supports the graphical language of ggplot2, with which high level of customization can be intuitive and flexible. However, ggplot2 itself does not provide low-level geometric objects or other support for tree-like structures. Even though outbreaktools and phyloseq are developed based on ggplot2, the most valuable part of ggplot2 syntax – adding layers of annotations – is not supported in these packages. For example, if we have plotted a tree without taxa labels, outbreaktools and phyloseq provide no easy way for general r users, who have little knowledge about the infrastructures of these packages, to add a layer of taxa labels.
ggtree extends ggplot2 to support tree objects and implements a geometric layer, geom_tree, to support visualizing tree structure. In ggtree, viewing a phylogenetic tree is relatively easy, via the command ‘ggplot(tree_object) + geom_tree() + theme_tree()’ or ‘ggtree(tree_object)’ for short. Layers of annotations can be added one by one via the ‘+’ operator. To facilitate tree visualization, ggtree provides several geometric layers, including geom_treescale for adding legend of tree scale (genetic distance, divergence time, etc.), geom_range for displaying uncertainty of branch lengths (confidence interval or range, etc.), geom_tiplab for adding taxa label, geom_tippoint and geom_nodepoint for adding symbols of tips and internal nodes, geom_hilight for highlighting a clade with rectangle and geom_cladelabel for annotating a selected clade with a bar and text label (Table 2).
| Function | Description |
|---|---|
| as.binary | Convert a multifurcating tree to a binary tree by resolving the polytomy with zero branch lengths. |
| MRCA | Find most recent common ancestor of two or more tips |
| read.paml_rst | Parse an ‘rst’ file from paml, which is then stored in a paml_rst object; outputs from baseml and codeml are supported |
| read.baseml | Parse the output of baseml, which is then stored in a baseml object |
| read.codeml_mlc | Parse a ‘mlc’ file from codeml which is then stored in a codeml_mlc object |
| read.codeml | Parse the output from codeml, which is then stored in a codeml object |
| read.hyphy | Parse the output from hyphy, which is then stored in a hyphy object |
| read.beast | Parse the output from beast, which is then stored in a beast object |
| read.raxml | Parse the output from raxml, which is then stored in a raxml object |
| read.r8s | Parse the output from r8s, which is then stored in an r8s object |
| read.jplace | Parse a jplace file into a jplace object. Outputs from epa, pplacer and ggtree are supported |
| read.nhx | Parse a NHX file into nhx object. Outputs from phylodog and revbayes are supported |
| read.phylip | Parse PHYLIP tree file. |
| apeBoot | Integrate phylo object with bootstrap values from ape::boot.phylo and stored in apeBootstrap object |
| phyPML | Parse output from phangorn::optim.pml, stored inferred ancestral sequences and stored the result in phangorn object |
| get.fields | List the annotation attributes stored in a tree object |
| get.treetext | Extract the newick tree string from tree objects |
| get.tree | Extract the phylo object (tree representation) from a tree object |
| get.tipseqs | Extract tip sequences from baseml, codeml or hyphy objects |
| get.subs | Extract nucleotide or amino acid substitutions along the tree from a baseml, codeml or hyphy object |
| get.placement | Extract placement information from a jplace object parsed from the output of epa or pplacer |
| get.phylopic | Download a silhouette image from the PhyloPic data base |
| plot | Plot methods for quickly viewing the annotation data of all types of tree objects defined in ggtree |
| ggtree | Construct a tree view from a tree object. Supported layouts are rectangular, slanted, circular, fan, unrooted and two-dimensional tree |
| geom_tree | Layer to support drawing a tree view with ggplot2 |
| geom_cladelabel | Layer to annotate clade with bar and text label |
| geom_range | Layer to annotate uncertainty of branch lengths |
| geom_hilight | Layer to highlight selected clade with rectangle |
| geom_tiplab | Layer to add labels to tree tips |
| geom_tippoint | Layer to add symbols to tree tips |
| geom_nodepoint | Layer to add symbols to internal nodes |
| geom_rootpoint | Layer to add symbols to root node |
| geom_treescale | Layer to add tree scale (e.g. substitution rate) |
| geom_text2 | Modified version of geom_text with subset supported |
| geom_point2 | Modified version of geom_point with subset supported |
| geom_segment2 | Modified version of geom_segment with subset supported |
| theme_tree | Blank theme |
| theme_tree2 | Blank theme with evolutionary distance as the x-axis |
| theme_transparent | Background transparent theme |
| theme_inset | Blank theme with background transparent |
| scale_color | Define colours based on the numerical values (scale) of attributes associated with a tree. These can then be used in colouring a tree or annotation data |
| collapse | Collapse a selected clade |
| expand | Expand a collapsed clade |
| scaleClade | Zoom in or zoom out a selected clade |
| flip | Exchange positions of two clades that share a same parent node |
| rotate | Rotate a selected clade |
| groupOTU | Group selected OTUs by tracing back to their most recent common ancestor |
| groupClade | Group a selected clade or list of clades |
| gzoom | Zoom a selected portion of a very large tree |
| viewClade | Visualize a clade of a tree |
| gheatmap | Visualize a tree with an associated matrix displayed next to the tree as a heatmap |
| subview | Embed subplot |
| inset | Annotate nodes with subplots |
| nodebar | Create a list of bar charts for nodes |
| nodepie | Create a list of pie charts for nodes |
| phylopic | Annotate a tree with a silhouette image downloaded from the PhyloPic data base |
| mask | Mask all genetic substitutions on the tree branches, except for those specified. |
| msaplot | Visualize a tree with a multiple sequence alignment displayed next to the tree |
| open_tree | Convert circular layout tree to fan layout |
| rescale_tree | Rescale branch length |
| rotate_tree | Rotate tree by specific angle |
| %<% | Update a tree view with another tree object |
| %<+% | Append user-specific annotation data to an existing tree view. These data can be used for annotating the tree |
| write.jplace | Output a jplace file of a tree with user-specified data. It can be used to store a tree with user's own annotation data. The output will be able to be parsed by read.jplace and is fully supported in ggtree |
File formats and S4 classes
In ggtree, the S4 class defines a compound tree-based object that contains the tree and other information associated to the tree, branches or nodes. ggtree can read a number of tree file formats, including newick and nexus (via ape), NHX, jplace (Matsen et al. 2012) and phylip, into a S4 tree object. Non-standard analysis output files from various evolutionary biology software packages including beast (Bouckaert et al. 2014), epa (Berger, Krompass & Stamatakis 2011), hyphy (Pond, Frost & Muse 2005), paml (Yang 2007), phylodog (Bastien et al. 2013), pplacer (Matsen, Kodner & Armbrust 2010), raxml (Stamatakis 2014), revbayes (Sebastian et al. 2014) and r8s (Sanderson 2003) (Table 1) can also be parsed into S4 objects using functions read.beast, read.codeml_mlc, read.codeml, read.hyphy, read.jplace, read.nhx, read.paml_rst, read.phylip, read.raxml and read.r8s (Fig. 1, Table 2). After parsing, some node/branch-specific attribute data (e.g. evolutionary rates, ancestral/taxon sequences) are extracted from the files and stored in the S4 tree object. An overview of S4 classes and corresponding parser functions is illustrated in Fig. 1.

Furthermore, ggtree provides a function, merge_tree, to combine two trees together with their node/branch-specific attribute data. Essentially, as a result, one such attribute (e.g. evolutionary rate) can be mapped to another attribute (e.g. dN/dS) of the same branch/node for comparison and further computations (Fig. 2). ggtree can also directly visualize and annotate phylo, multiphylo, phylo4, phylo4d, obkdata and phyloseq tree-related objects that are defined in other r packages. The tree object in ggtree can also be converted, via get.tree(), to phylo or multiphylo objects that are widely used in other r packages. In addition, ggtree provides fortify method to convert the tree object to a tidy data frame, which is familiar to r users and easy to manipulate. Therefore, ggtree represents an infrastructure that enables phylogeny/taxon-related data inferred from different external computer programs or r packages, to be unified and analysed in r.

Example 1: parsing tree and analysis output files
To illustrate the utilities of ggtree, we used a previously published data set: 76 H3 hemagglutinin gene sequences of a lineage containing swine and human influenza A viruses (Liang et al. 2014). The data set was re-analysed by beast for timescale estimation and codeml for synonymous and non-synonymous substitutions estimation. In this example, we first parsed the outputs from beast using read.beast and from codeml using read.codeml into two tree objects. Then, the two objects containing two sets of branch/node-specific data were merged via the merge_tree function.
-
library(ggtree)
-
beast_file<-system.file(“examples/
-
MCC_FluA_H3.tree”,package=“ggtree”)
-
rst_file<-system.file(“examples/rst”,
-
package=“ggtree”)
-
mlc_file<-system.file(“examples/mlc”,
-
package=“ggtree”)
-
beast_tree<-read.beast(beast_file)
-
codeml_tree<-read.codeml(rst_file, mlc_file)
-
merged_tree<-merge_tree(beast_tree,codeml_tree)get.fields(merged_tree)
-
##[1]“height” “height_0.95_HPD” “height_median”
-
##[4]“height_range” “length” “length_0.95_HPD”
-
##[7]“length_median” “length_range” “posterior”
-
##[10]“rate” “rate_0.95_HPD” “rate_median”
-
##[13]“rate_range” “t” “N”
-
##[16]“S” “dN_vs_dS” “dN”
-
##[19]“dS”“N_x_dN” “S_x_dS”
-
##[22]“marginal_subs” “joint_subs”
-
“marginal_AA_subs”
-
##[25]“joint_AA_subs”
After merging the beast_tree and codeml_tree objects, all branch/node-specific data inferred by beast and codeml are available in the merged_tree object, in the components [1-13] and [14-25] of the vector above. We further converted the tree object to data frame, df, and visualized hexbin scatter plot of dN/dS, dN and dS inferred by codeml vs. rate inferred by beast on the same branches.
-
df<-fortify(merged_tree)
-
df<-df[,c(“dN_vs_dS”,“dN”,“dS”,“rate”)]
-
df<-na.omit(df)
-
df<-df[df$dN_vs_dS>=0&df$dN_vs_dS<=1.5,]%>%
-
tidyr::gather(type, value, dN_vs_dS:dS)
-
df$type[df$type==“dN_vs_dS”]<-“dN/dS”
-
levels(df$type)<-c(“dN/dS”,“dN”,“dS”)
-
ggplot(df,aes(rate, value))+ geom_hex()+
-
facet_wrap(~ type, scale=“free_y”)
The output is illustrated in Fig. 2. We can then test the association of these branch/node-specific data using Pearson correlation, which in this case showed that dN and dS are significantly associated with rate but not dN/dS.
Example 2: phylogenetic tree visualization and annotation
The following example turns the merged_tree into a graphic object with tree branches coloured by branch-specific substitution rates (rate) as shown in Fig. 3a.

-
p<-ggtree(merged_tree,aes(color=rate))+
-
theme_tree2()+
-
scale_color_continuous(high=‘#D55E00’,
-
low=‘#0072B2’)+ geom_tiplab(size=2)
Other branch/node-specific data stored in the tree object (Fig. 5) can be displayed as an additional graphic layer of annotation on top of a tree. Complex presentations of trees are made possible by adding multiple layers of annotations. Phylogenetic tree can be rescaled using any numerical variable associated with branches. For instance, branch-specific estimates of dN, dS and ω from codeml analysis, can be used as lengths and colours of the branches in the tree (Fig. 3b). Tree nodes can be given different symbols based on the categorical values associated (Fig. 3c). ggtree can display a tree in different layouts, including rectangular, slanted, circular and fan layouts for phylogram and cladogram, rooted/unrooted, timescaled and two-dimensional phylogenies.
Compared to other phylogenetic tree visualizing packages, ggtree excels at visual exploration of tree structure and related data. For example, complex tree view with several annotation layers can be transferred to a new tree object without step-by-step re-creation. We have created an operator, %<%, to update a tree view with a new tree object. The following example rescales the branch lengths of the tree (merged_tree) with the branch-specific dN values and updates the graphic object (p) with this new tree via %<%. The branch colours of this updated tree view were re-mapped from ‘rate’ to ‘dN’.
-
p%<%rescale_tree(merged_tree,‘dN’)+ aes(color=dN)
The groupClade function assigns the branches and nodes under different clades into different groups. Similarly, groupOTU function assigns branches and nodes to different groups based on user-specified groups of operational taxonomic units (OTUs) that are not necessarily within a clade, but can be monophyletic (clade), polyphyletic or paraphyletic. A phylogenetic tree can be annotated by mapping different line type, size, colour or shape to the branches or nodes that have been assigned to different groups. In the following example (Fig. 3c), we assigned branches and nodes to different groups based on the host species of the taxa via groupOTU(). According to the groupings, branches were then given different colours and line types, and the taxa were given symbols with different colours and shapes. We also applied the timescale, in Gregorian calendar, to the branch lengths by setting the most recent sampling date (mrsd).
-
tip<-get.tree(merged_tree)$tip.label
-
merged_tree<-groupOTU(merged_tree,tip[grep
-
(“Swine”,tip)],“host”)
-
ggtree(merged_tree,aes(color=host,
-
inetype=host),mrsd=“2013-01-01”)+
-
geom_tippoint(aes(shape=host))+theme_tree2()
To facilitate viewing and manipulating a phylogenetic tree, ggtree provides a number of helper functions. For example, the gzoom or viewClade functions allow the user to zoom into a selected portion or display a selected clade respectively. Other common tree manipulations could be achieved by collapse, expand, rotate, flip, etc., functions. A list of the major ggtree functions is given in Table 2, and their detailed explanations and examples are provided in the online vignette.
Example 3: two-dimensional trees
The y-axis or width of a conventionally laid out tree (i.e. with tree branches spanning horizontally along the x-axis, as shown in Fig. 3) often provides only regular spatial separation to the tree branches, without quantitative biological meanings. ggtree can draw ‘two-dimensional’ trees by rescaling the y-axis/tree width to a node-specific numerical attribute that might be a measure of certain biological characteristics of the taxa and hypothetical ancestors in the tree. In this example, we used the previous timescaled tree object and aimed to scale its y-axis/tree width based on the predicted N-linked glycosylation sites (NLG) for each of the taxon and ancestral sequences. The NLG sites were predicted using the netnglyc 1.0 Server (Blom et al. 2004) and were read into r and stored in NAG variable (Appendix S2). To scale the y-axis, the parameter yscale in the ggtree() function is set to a numerical or categorical variable. If yscale is a categorical variable, users should specify how the categories are to be mapped to numerical values via the yscale_mapping variable as demonstrated in this case. The resultant two-dimensional tree is shown in Fig. 4.

-
ggtree(merged_tree,aes(color=host),mrsd=“2013-
-
01-01”,yscale=“label”,yscale_mapping=NAG)
Example 4: more complex tree annotations
In this example, we demonstrate more complex tree annotation with additional texts and shapes (Fig. 5). We first visualized the tree in timescale and branches coloured by dN/dS. The tree was annotated with clade probabilities and the amino acid substitutions. The substitutions were determined via parent–child sequence comparison from the taxon sequences and ancestral sequences that can be estimated by any of hyphy, baseml or codeml.

While ggtree supports tree annotation using data from a list of software (Table 1), it also easily accepts user-defined data. In ggtree, the operator %<+% has been defined to allow user-defined annotation data (host.df in this example) to attach to a tree graphic object. In this example, we attached the host species information to the tree view and coloured the circle symbols and labels of taxa based on this information.
Users may have a matrix of data (from experiments or data analysis) about the taxa in the phylogenetic tree. In ggtree, this data matrix can be displayed as a heat map aligning with the corresponding taxa at the right side of the tree by gheatmap function. Here, we annotated the tree with a heat map of the genotypes for each taxon (Fig. 5). In the genotype matrix, the colour of each of the eight boxes indicates the lineage of each gene segment of the viruses that was classified according to Lam et al. (2011) and Liang et al. (2014).
ggtree provides subview function to add a subplot in a new layer of main plot. In this example, the tree with the associated matrix was condensed into rectangular and fan shapes and was plotted as subplot inside Fig. 5.
-
##Below is codeexcerpt,see Appendix S2 fordetails
-
##visualize a tree with branches in timescale and
-
##coloured by dN/dS.
-
cols<-scale_color(merged_tree,“dN_vs_dS”,
-
low=“#0072B2”,high=“#D55E00”,
-
interval=seq(0,1.5,length.out=100))
-
p<-ggtree(merged_tree,size=.8,mrsd=“2013-01-
-
01”,ndigits= 2,color=cols)
-
##add annotation of amino acid substitution
-
## inferred by joint probabilities
-
p<- p+geom_text(aes(x=branch,
-
label=joint_AA_subs),vjust=-.03,size=1.8)
-
##use%<+%operatortoattachthehostinformationto
-
##thetreeview
-
p<-p%<+%host.df
-
##after the attachment via %<+% operator,
-
##we can use host information to colour circles and
-
##labels of tips.
-
p <-p + geom_tippoint(aes(color=host),size=2)+
-
geom_tiplab(aes(color=host), align=TRUE, size=3,
-
linesize=.3)
-
##visualize genotype heatmap
-
gheatmap(p, genotype, width=.4,offset=7,
-
colnames=F)%>%scale_x_ggtree
In addition to heat map display of taxa-associated matrix data, the underlying multiple sequence alignment of the taxa could be displayed with the tree using the msaplot function. Furthermore, trees can also be annotated with subplots of different types of graphs (e.g. bar, pie, box plot) using inset function or with silhouette images taken from the PhyloPic data base (http://phylopic.org) with phylopic function.
Conclusions
The ggtree package features (i) high interoperability, as ggtree can import evolutionary data from different tree file formats and analysis programs as well as other associated data from experiments, so that various sources and types of data can be displayed on a tree for comparison and further analyses; (ii) complex phylogenetic presentations, such as two-dimensional tree and graph/image-associated trees; (iii) highly flexible graphic system, as ggtree extends ggplot2 and allows creating separate geometric layers that can be freely combined, removed and rearranged to supports diverse but convenient ways of tree manipulation and visualization. ggtree also supports visualization of tree objects defined by other r packages so that ggtree can be easily integrated into their analysis/packages. For example, phyloseq tree object and microbiome data can be visualized using ggtree (Appendix S1). With the help of ggtree, users can easily create large phylogenetic trees with complex annotations by integrating various associated data including temporal, spatial and genotypic information, such as those trees created in Liang et al. (2014) and Lam et al. (2015).
Acknowledgements
We gratefully thank the Editor and three anonymous reviewers for their useful suggestions and comments that have significantly improved this manuscript. This research was supported by Seed Funding Programme for Basic Research, HKU (201411159214), Theme-based Research Scheme (T11-705/14-N) and Area of Excellence Scheme grant (AoE/M-12/06) from University Grants Committee of the HKSAR. This research is conducted in part using the research computing facilities (HPC2015) and advisory services offered by Information Technology Services, HKU. The authors declared no conflict of interest to the publication of this work.
Author contributions
G.Y. and T.T.-Y.L. conceived and developed the methods and r package; G.Y., D.K.S. and T.T.-Y.L. wrote the manuscript; G.Y., D.K.S., H.Z., Y.G. and T.T.-Y.L. contributed to the final version of the manuscript.
Data accessibility
Example data deposited in the Dryad repository: http://datadryad.org/resource/doi:10.5061/dryad.v15v0 (Yu et al. 2016).